plotly.express to create interactive visualizations that enhance the comprehension and engagement of the analysis.sklearn.decomposition was employed for efficient dimensionality reduction, helping to simplify the models by extracting the most relevant features, which is vital for both enhancing performance and interpretability.precision_recall_curve, it evaluates classifier performance by calculating precision and recall at various thresholds, helping to optimize the trade-off between capturing positives and minimizing false positives.precision_recall_curve, it computes precision and recall for different probability thresholds to assess classifier performance.silhouette score was employed to evaluate the clustering quality. This metric measures how similar each point is to its own cluster compared to other clusters, providing an insight into the cohesiveness and separation of the clusters. A higher silhouette score indicates better-defined clusters, enhancing the reliability of the clustering results.StandardScaler to normalize the dataset before clustering. This preprocessing step ensures that each feature contributes equally to the distance calculations, thereby improving the clustering performance and the interpretability of the silhouette score.AgglomerativeClustering was also implemented as our clustering algorithm. This hierarchical approach builds clusters by iteratively merging the closest pairs of clusters, creating a tree-like structure. By combining this method with the silhouette score and data normalization via StandardScaler, we achieved more accurate and meaningful clustering results.ROC curve. This tool illustrates the true positive rate against the false positive rate at various threshold settings, providing a comprehensive view of the model's ability to distinguish between classes.(AUC) to quantify the performance of our model using the ROC curve. The AUC provides a single metric that summarizes the model's ability to discriminate between classes, with higher values indicating better performance.(PCA) was employed to reduce the dimensionality of our dataset in section E.from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
import plotly.express as px
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import silhouette_score, mean_absolute_error, accuracy_score, classification_report, confusion_matrix, roc_curve, auc, precision_recall_curve
results = pd.read_csv('results.csv')
shootouts = pd.read_csv('shootouts.csv')
goalscorers = pd.read_csv('goalscorers.csv')
results.head()
| date | home_team | away_team | home_score | away_score | tournament | city | country | neutral | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1872-11-30 | Scotland | England | 0.0 | 0.0 | Friendly | Glasgow | Scotland | False |
| 1 | 1873-03-08 | England | Scotland | 4.0 | 2.0 | Friendly | London | England | False |
| 2 | 1874-03-07 | Scotland | England | 2.0 | 1.0 | Friendly | Glasgow | Scotland | False |
| 3 | 1875-03-06 | England | Scotland | 2.0 | 2.0 | Friendly | London | England | False |
| 4 | 1876-03-04 | Scotland | England | 3.0 | 0.0 | Friendly | Glasgow | Scotland | False |
shootouts.head()
| date | home_team | away_team | winner | first_shooter | |
|---|---|---|---|---|---|
| 0 | 1967-08-22 | India | Taiwan | Taiwan | NaN |
| 1 | 1971-11-14 | South Korea | Vietnam Republic | South Korea | NaN |
| 2 | 1972-05-07 | South Korea | Iraq | Iraq | NaN |
| 3 | 1972-05-17 | Thailand | South Korea | South Korea | NaN |
| 4 | 1972-05-19 | Thailand | Cambodia | Thailand | NaN |
goalscorers.head()
| date | home_team | away_team | team | scorer | minute | own_goal | penalty | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1916-07-02 | Chile | Uruguay | Uruguay | José Piendibene | 44.0 | False | False |
| 1 | 1916-07-02 | Chile | Uruguay | Uruguay | Isabelino Gradín | 55.0 | False | False |
| 2 | 1916-07-02 | Chile | Uruguay | Uruguay | Isabelino Gradín | 70.0 | False | False |
| 3 | 1916-07-02 | Chile | Uruguay | Uruguay | José Piendibene | 75.0 | False | False |
| 4 | 1916-07-06 | Argentina | Chile | Argentina | Alberto Ohaco | 2.0 | False | False |
# Determine the match outcome
def determine_outcome(row):
if row['home_score'] > row['away_score']:
return 'Home Win'
elif row['home_score'] < row['away_score']:
return 'Away Win'
else:
return 'Draw'
# Determine the match outcomes
outcomes = results.apply(determine_outcome, axis=1)
outcome_counts = outcomes.value_counts().reset_index()
outcome_counts.columns = ['Outcome', 'Count']
# Bar plot
fig = px.bar(outcome_counts, x='Outcome', y='Count', title='Distribution of Match Outcomes',
labels={'Count': 'Number of Matches'},
hover_data={'Count': True},
color='Outcome',
color_discrete_map={'Home Win': 'green', 'Away Win': 'red', 'Draw': 'blue'})
fig.update_layout(title={'text': 'Match Outcomes', 'x':0.5, 'xanchor': 'center'})
# Plot display
fig.show()
# Convert the date column to datetime
results['date'] = pd.to_datetime(results['date'])
# Extract the year from the date column
results_year = results['date'].dt.year
# Calculate the average goals per year for home and away teams
average_goals_per_year = results.groupby(results_year)[['home_score', 'away_score']].mean().reset_index()
average_goals_per_year.columns = ['Year', 'Home Goals', 'Away Goals']
# Line plot for average goals
plt.figure(figsize=(12, 6))
plt.plot(average_goals_per_year['Year'], average_goals_per_year['Home Goals'], marker='o', linestyle='-', color='green', label='Home Goals')
plt.plot(average_goals_per_year['Year'], average_goals_per_year['Away Goals'], marker='o', linestyle='-', color='red', label='Away Goals')
plt.title('Average Goals Scored by Home and Away Teams Over Time')
plt.xlabel('Year')
plt.ylabel('Average Number of Goals')
plt.legend()
plt.grid(True)
# Plot display
plt.show()
# Ensure the date columns are of the same type
results['date'] = pd.to_datetime(results['date'])
shootouts['date'] = pd.to_datetime(shootouts['date'])
# Merge results and shootouts datasets
merged_data = results.merge(shootouts[['date', 'home_team', 'away_team']], on=['date', 'home_team', 'away_team'], how='left', indicator=True)
# Indicate if the match ended in a shootout
merged_data['shootout'] = merged_data['_merge'] == 'both'
# Calculate the average goals for matches with and without shootouts
average_goals_plotly = merged_data.groupby('shootout')[['home_score', 'away_score']].mean().sum(axis=1).reset_index()
average_goals_plotly.columns = ['Shootout', 'Average Goals']
average_goals_plotly['Shootout'] = average_goals_plotly['Shootout'].map({True: 'Shootout', False: 'No Shootout'})
# Bar plot
fig = px.bar(average_goals_plotly, x='Shootout', y='Average Goals', title='Average Goals in Matches with and without Penalty Shootouts',
labels={'Average Goals': 'Average Number of Goals', 'Shootout': 'Match Type'},
hover_data={'Average Goals': True},
color='Shootout',
color_discrete_map={'Shootout': 'orange', 'No Shootout': 'blue'})
fig.update_layout(title={'text': 'Average Goals in Matches with and without Penalty Shootouts', 'x':0.5, 'xanchor': 'center'})
# Display plot
fig.show()
# Ensure the date column is in datetime format
shootouts['date'] = pd.to_datetime(shootouts['date'])
# Count the number of shootouts per year without adding a column
shootouts_per_year = shootouts['date'].dt.year.value_counts().reset_index()
shootouts_per_year.columns = ['Year', 'Number of Shootouts']
shootouts_per_year = shootouts_per_year.sort_values('Year')
# Create the plot
fig3_corrected = px.bar(shootouts_per_year, x='Year', y='Number of Shootouts', title='Number of Penalty Shootouts Over Years',
labels={'Number of Shootouts': 'Number of Shootouts', 'Year': 'Year'},
hover_data={'Number of Shootouts': True},
color='Number of Shootouts',
color_continuous_scale='Inferno')
fig3_corrected.update_layout(title={'text': 'Number of Penalty Shootouts Over Years', 'x':0.5, 'xanchor': 'center'})
# Show the plot
fig3_corrected.show()
# Convert the date column to datetime in the goalscorers dataset
goalscorers['date'] = pd.to_datetime(goalscorers['date'])
# Set the number of periods for the analysis
num_periods = 6
# Determine the minimum and maximum year in the dataset
min_year = goalscorers['date'].dt.year.min()
max_year = goalscorers['date'].dt.year.max()
# Calculate the length of each period
period_length = (max_year - min_year) // num_periods
# Define the periods as tuples of (start_year, end_year)
periods = [(min_year + i * period_length, min_year + (i + 1) * period_length - 1) for i in range(num_periods)]
periods[-1] = (periods[-1][0], max_year) # Adjust the last period to include the max_year
def calculate_goals_by_half_for_period(start_year, end_year):
# Filter the data for the given period
period_goals = goalscorers[(goalscorers['date'].dt.year >= start_year) & (goalscorers['date'].dt.year <= end_year)]
# Calculate goals scored in the first and second half
first_half_goals = period_goals[period_goals['minute'] <= 45].shape[0]
second_half_goals = period_goals[period_goals['minute'] > 45].shape[0]
return first_half_goals, second_half_goals
# Calculate the number of goals scored in each half for each period
goals_by_period = [calculate_goals_by_half_for_period(start_year, end_year) for start_year, end_year in periods]
# Define colors for the pie charts
colors = ['#00FF00', '#FF0000']
BLUE = '\033[94m'
BOLD = '\033[1m'
RESET = '\033[0m'
text = "Number of goals in the first half vs the second half"
formatted_text = f"{BOLD}{BLUE}{text.center(110)}{RESET}"
print(formatted_text)
plt.figure(figsize=(18, 12))
for i, (start_year, end_year) in enumerate(periods):
plt.subplot(2, 3, i + 1) # 2 rows, 3 columns
first_half_goals, second_half_goals = goals_by_period[i]
plt.pie([first_half_goals, second_half_goals], labels=['First Half', 'Second Half'], autopct='%1.1f%%', startangle=140, colors=colors, textprops={'fontsize': 14})
plt.title(f'Goals Scored by Half ({start_year}-{end_year})', fontsize=16, ha='center')
plt.tight_layout()
# Plot display
plt.show()
Number of goals in the first half vs the second half
# Define the bins for dividing match minutes into periods
bins = [0, 15, 30, 45, 60, 75, 90]
# Define labels for each period
labels = ['0-15', '16-30', '31-45', '46-60', '61-75', '76-90']
# Categorize 'minute' column in goalscorers DataFrame into defined periods
goalscorers['period'] = pd.cut(goalscorers['minute'], bins=bins, labels=labels, right=False)
# Count the number of goals scored in each period and sort by period
goals_by_period = goalscorers['period'].value_counts().sort_index()
# Create a figure for plot
plt.figure(figsize=(12, 6))
# Generate a color palette for the bars
colors = sns.color_palette("coolwarm", len(labels))
# Plot the distribution of goals by period as a bar chart
goals_by_period.plot(kind='bar', color=colors, edgecolor='black')
# Add title and labels to the plot
plt.title('Goals Distribution by Match Period', fontsize=16)
plt.xlabel('Match Period', fontsize=14)
plt.ylabel('Goals Scored', fontsize=14)
# Customize the x and y ticks
plt.xticks(rotation=0, fontsize=12)
plt.yticks(fontsize=12)
# Add a grid to the plot for better readability
plt.grid(True, linestyle='--', alpha=0.7)
# Annotate the bars with the goal counts
for index, value in enumerate(goals_by_period):
plt.text(index, value + 1, str(value), ha='center', fontsize=12)
# Adjust layout to prevent clipping of labels
plt.tight_layout()
# Plot display
plt.show()
# Adding "Home team won" feature
def determine_winner(row):
if row['home_score'] > row['away_score']:
return True
else:
return False
results['Home_team_won'] = results.apply(determine_winner, axis=1)
results.head()
| date | home_team | away_team | home_score | away_score | tournament | city | country | neutral | Home_team_won | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1872-11-30 | Scotland | England | 0.0 | 0.0 | Friendly | Glasgow | Scotland | False | False |
| 1 | 1873-03-08 | England | Scotland | 4.0 | 2.0 | Friendly | London | England | False | True |
| 2 | 1874-03-07 | Scotland | England | 2.0 | 1.0 | Friendly | Glasgow | Scotland | False | True |
| 3 | 1875-03-06 | England | Scotland | 2.0 | 2.0 | Friendly | London | England | False | False |
| 4 | 1876-03-04 | Scotland | England | 3.0 | 0.0 | Friendly | Glasgow | Scotland | False | True |
# Calculate the cumulative number of home wins and total home matches up to each match
cumulative_home_wins = results['Home_team_won'].cumsum()
cumulative_total_home_matches = results.reset_index().index + 1
# Compute the home team win rate as the ratio of cumulative home wins to cumulative total home matches
results['home_team_win_rate'] = cumulative_home_wins / cumulative_total_home_matches
results.head()
| date | home_team | away_team | home_score | away_score | tournament | city | country | neutral | Home_team_won | home_team_win_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1872-11-30 | Scotland | England | 0.0 | 0.0 | Friendly | Glasgow | Scotland | False | False | 0.000000 |
| 1 | 1873-03-08 | England | Scotland | 4.0 | 2.0 | Friendly | London | England | False | True | 0.500000 |
| 2 | 1874-03-07 | Scotland | England | 2.0 | 1.0 | Friendly | Glasgow | Scotland | False | True | 0.666667 |
| 3 | 1875-03-06 | England | Scotland | 2.0 | 2.0 | Friendly | London | England | False | False | 0.500000 |
| 4 | 1876-03-04 | Scotland | England | 3.0 | 0.0 | Friendly | Glasgow | Scotland | False | True | 0.600000 |
# Add a column indicating whether the away team won the match
results['away_team_won'] = results['away_score'] > results['home_score']
# Calculate the cumulative number of away wins and total away matches up to each match
cumulative_away_wins = results['away_team_won'].cumsum()
cumulative_total_away_matches = results.reset_index().index + 1
# Compute the away team win rate as the ratio of cumulative away wins to cumulative total away matches
results['away_team_win_rate'] = cumulative_away_wins / cumulative_total_away_matches
# Drop the 'away_team_won' column
results = results.drop(columns=['away_team_won'])
results.head()
| date | home_team | away_team | home_score | away_score | tournament | city | country | neutral | Home_team_won | home_team_win_rate | away_team_win_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1872-11-30 | Scotland | England | 0.0 | 0.0 | Friendly | Glasgow | Scotland | False | False | 0.000000 | 0.0 |
| 1 | 1873-03-08 | England | Scotland | 4.0 | 2.0 | Friendly | London | England | False | True | 0.500000 | 0.0 |
| 2 | 1874-03-07 | Scotland | England | 2.0 | 1.0 | Friendly | Glasgow | Scotland | False | True | 0.666667 | 0.0 |
| 3 | 1875-03-06 | England | Scotland | 2.0 | 2.0 | Friendly | London | England | False | False | 0.500000 | 0.0 |
| 4 | 1876-03-04 | Scotland | England | 3.0 | 0.0 | Friendly | Glasgow | Scotland | False | True | 0.600000 | 0.0 |
# Calculate the cumulative sum of goals scored by the home team up to each match
cumulative_home_goals = results['home_score'].cumsum()
# Compute the home team average goals as the ratio of cumulative home goals to cumulative total home matches
results['home_team_avg_goals'] = cumulative_home_goals / cumulative_total_away_matches
results.head()
| date | home_team | away_team | home_score | away_score | tournament | city | country | neutral | Home_team_won | home_team_win_rate | away_team_win_rate | home_team_avg_goals | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1872-11-30 | Scotland | England | 0.0 | 0.0 | Friendly | Glasgow | Scotland | False | False | 0.000000 | 0.0 | 0.0 |
| 1 | 1873-03-08 | England | Scotland | 4.0 | 2.0 | Friendly | London | England | False | True | 0.500000 | 0.0 | 2.0 |
| 2 | 1874-03-07 | Scotland | England | 2.0 | 1.0 | Friendly | Glasgow | Scotland | False | True | 0.666667 | 0.0 | 2.0 |
| 3 | 1875-03-06 | England | Scotland | 2.0 | 2.0 | Friendly | London | England | False | False | 0.500000 | 0.0 | 2.0 |
| 4 | 1876-03-04 | Scotland | England | 3.0 | 0.0 | Friendly | Glasgow | Scotland | False | True | 0.600000 | 0.0 | 2.2 |
# Calculate the cumulative sum of goals scored by the away team up to each match
cumulative_away_goals = results['away_score'].cumsum()
# Compute the away team average goals as the ratio of cumulative away goals to cumulative total matches
results['away_team_avg_goals'] = cumulative_away_goals / cumulative_total_away_matches
results.head()
| date | home_team | away_team | home_score | away_score | tournament | city | country | neutral | Home_team_won | home_team_win_rate | away_team_win_rate | home_team_avg_goals | away_team_avg_goals | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1872-11-30 | Scotland | England | 0.0 | 0.0 | Friendly | Glasgow | Scotland | False | False | 0.000000 | 0.0 | 0.0 | 0.00 |
| 1 | 1873-03-08 | England | Scotland | 4.0 | 2.0 | Friendly | London | England | False | True | 0.500000 | 0.0 | 2.0 | 1.00 |
| 2 | 1874-03-07 | Scotland | England | 2.0 | 1.0 | Friendly | Glasgow | Scotland | False | True | 0.666667 | 0.0 | 2.0 | 1.00 |
| 3 | 1875-03-06 | England | Scotland | 2.0 | 2.0 | Friendly | London | England | False | False | 0.500000 | 0.0 | 2.0 | 1.25 |
| 4 | 1876-03-04 | Scotland | England | 3.0 | 0.0 | Friendly | Glasgow | Scotland | False | True | 0.600000 | 0.0 | 2.2 | 1.00 |
results['goal_difference'] = results['home_score'] - results['away_score']
results.head()
| date | home_team | away_team | home_score | away_score | tournament | city | country | neutral | Home_team_won | home_team_win_rate | away_team_win_rate | home_team_avg_goals | away_team_avg_goals | goal_difference | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1872-11-30 | Scotland | England | 0.0 | 0.0 | Friendly | Glasgow | Scotland | False | False | 0.000000 | 0.0 | 0.0 | 0.00 | 0.0 |
| 1 | 1873-03-08 | England | Scotland | 4.0 | 2.0 | Friendly | London | England | False | True | 0.500000 | 0.0 | 2.0 | 1.00 | 2.0 |
| 2 | 1874-03-07 | Scotland | England | 2.0 | 1.0 | Friendly | Glasgow | Scotland | False | True | 0.666667 | 0.0 | 2.0 | 1.00 | 1.0 |
| 3 | 1875-03-06 | England | Scotland | 2.0 | 2.0 | Friendly | London | England | False | False | 0.500000 | 0.0 | 2.0 | 1.25 | 0.0 |
| 4 | 1876-03-04 | Scotland | England | 3.0 | 0.0 | Friendly | Glasgow | Scotland | False | True | 0.600000 | 0.0 | 2.2 | 1.00 | 3.0 |
results['match_result'] = results.apply(
lambda row: 'Home Win' if row['home_score'] > row['away_score'] else ('Away Win' if row['home_score'] < row['away_score'] else 'Draw'), axis=1
)
results.head()
| date | home_team | away_team | home_score | away_score | tournament | city | country | neutral | Home_team_won | home_team_win_rate | away_team_win_rate | home_team_avg_goals | away_team_avg_goals | goal_difference | match_result | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1872-11-30 | Scotland | England | 0.0 | 0.0 | Friendly | Glasgow | Scotland | False | False | 0.000000 | 0.0 | 0.0 | 0.00 | 0.0 | Draw |
| 1 | 1873-03-08 | England | Scotland | 4.0 | 2.0 | Friendly | London | England | False | True | 0.500000 | 0.0 | 2.0 | 1.00 | 2.0 | Home Win |
| 2 | 1874-03-07 | Scotland | England | 2.0 | 1.0 | Friendly | Glasgow | Scotland | False | True | 0.666667 | 0.0 | 2.0 | 1.00 | 1.0 | Home Win |
| 3 | 1875-03-06 | England | Scotland | 2.0 | 2.0 | Friendly | London | England | False | False | 0.500000 | 0.0 | 2.0 | 1.25 | 0.0 | Draw |
| 4 | 1876-03-04 | Scotland | England | 3.0 | 0.0 | Friendly | Glasgow | Scotland | False | True | 0.600000 | 0.0 | 2.2 | 1.00 | 3.0 | Home Win |
# Add columns for clean sheets
results['home_team_clean_sheet'] = results['away_score'] == 0
results['away_team_clean_sheet'] = results['home_score'] == 0
results.head()
| date | home_team | away_team | home_score | away_score | tournament | city | country | neutral | Home_team_won | home_team_win_rate | away_team_win_rate | home_team_avg_goals | away_team_avg_goals | goal_difference | match_result | home_team_clean_sheet | away_team_clean_sheet | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1872-11-30 | Scotland | England | 0.0 | 0.0 | Friendly | Glasgow | Scotland | False | False | 0.000000 | 0.0 | 0.0 | 0.00 | 0.0 | Draw | True | True |
| 1 | 1873-03-08 | England | Scotland | 4.0 | 2.0 | Friendly | London | England | False | True | 0.500000 | 0.0 | 2.0 | 1.00 | 2.0 | Home Win | False | False |
| 2 | 1874-03-07 | Scotland | England | 2.0 | 1.0 | Friendly | Glasgow | Scotland | False | True | 0.666667 | 0.0 | 2.0 | 1.00 | 1.0 | Home Win | False | False |
| 3 | 1875-03-06 | England | Scotland | 2.0 | 2.0 | Friendly | London | England | False | False | 0.500000 | 0.0 | 2.0 | 1.25 | 0.0 | Draw | False | False |
| 4 | 1876-03-04 | Scotland | England | 3.0 | 0.0 | Friendly | Glasgow | Scotland | False | True | 0.600000 | 0.0 | 2.2 | 1.00 | 3.0 | Home Win | True | False |
results['total_goals'] = results['home_score'] + results['away_score']
results.head()
| date | home_team | away_team | home_score | away_score | tournament | city | country | neutral | Home_team_won | home_team_win_rate | away_team_win_rate | home_team_avg_goals | away_team_avg_goals | goal_difference | match_result | home_team_clean_sheet | away_team_clean_sheet | total_goals | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1872-11-30 | Scotland | England | 0.0 | 0.0 | Friendly | Glasgow | Scotland | False | False | 0.000000 | 0.0 | 0.0 | 0.00 | 0.0 | Draw | True | True | 0.0 |
| 1 | 1873-03-08 | England | Scotland | 4.0 | 2.0 | Friendly | London | England | False | True | 0.500000 | 0.0 | 2.0 | 1.00 | 2.0 | Home Win | False | False | 6.0 |
| 2 | 1874-03-07 | Scotland | England | 2.0 | 1.0 | Friendly | Glasgow | Scotland | False | True | 0.666667 | 0.0 | 2.0 | 1.00 | 1.0 | Home Win | False | False | 3.0 |
| 3 | 1875-03-06 | England | Scotland | 2.0 | 2.0 | Friendly | London | England | False | False | 0.500000 | 0.0 | 2.0 | 1.25 | 0.0 | Draw | False | False | 4.0 |
| 4 | 1876-03-04 | Scotland | England | 3.0 | 0.0 | Friendly | Glasgow | Scotland | False | True | 0.600000 | 0.0 | 2.2 | 1.00 | 3.0 | Home Win | True | False | 3.0 |
previous_home_match_win = {}
results['home_team_previous_match_win'] = False
for index, row in results.iterrows():
home_team = row['home_team']
if home_team in previous_home_match_win:
results.at[index, 'home_team_previous_match_win'] = previous_home_match_win[home_team]
previous_home_match_win[home_team] = row['home_score'] > row['away_score']
previous_away_match_win = {}
results['away_team_previous_match_win'] = False
for index, row in results.iterrows():
away_team = row['away_team']
if away_team in previous_away_match_win:
results.at[index, 'away_team_previous_match_win'] = previous_away_match_win[away_team]
previous_away_match_win[away_team] = row['away_score'] > row['home_score']
results.head()
| date | home_team | away_team | home_score | away_score | tournament | city | country | neutral | Home_team_won | ... | away_team_win_rate | home_team_avg_goals | away_team_avg_goals | goal_difference | match_result | home_team_clean_sheet | away_team_clean_sheet | total_goals | home_team_previous_match_win | away_team_previous_match_win | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1872-11-30 | Scotland | England | 0.0 | 0.0 | Friendly | Glasgow | Scotland | False | False | ... | 0.0 | 0.0 | 0.00 | 0.0 | Draw | True | True | 0.0 | False | False |
| 1 | 1873-03-08 | England | Scotland | 4.0 | 2.0 | Friendly | London | England | False | True | ... | 0.0 | 2.0 | 1.00 | 2.0 | Home Win | False | False | 6.0 | False | False |
| 2 | 1874-03-07 | Scotland | England | 2.0 | 1.0 | Friendly | Glasgow | Scotland | False | True | ... | 0.0 | 2.0 | 1.00 | 1.0 | Home Win | False | False | 3.0 | False | False |
| 3 | 1875-03-06 | England | Scotland | 2.0 | 2.0 | Friendly | London | England | False | False | ... | 0.0 | 2.0 | 1.25 | 0.0 | Draw | False | False | 4.0 | True | False |
| 4 | 1876-03-04 | Scotland | England | 3.0 | 0.0 | Friendly | Glasgow | Scotland | False | True | ... | 0.0 | 2.2 | 1.00 | 3.0 | Home Win | True | False | 3.0 | True | False |
5 rows × 21 columns
results['home_team_winning_streak'] = 0
results['away_team_winning_streak'] = 0
home_winning_streak = {}
away_winning_streak = {}
for index, row in results.iterrows():
home_team = row['home_team']
away_team = row['away_team']
if home_team in home_winning_streak:
results.at[index, 'home_team_winning_streak'] = home_winning_streak[home_team]
else:
results.at[index, 'home_team_winning_streak'] = 0
if away_team in away_winning_streak:
results.at[index, 'away_team_winning_streak'] = away_winning_streak[away_team]
else:
results.at[index, 'away_team_winning_streak'] = 0
if row['home_score'] > row['away_score']:
home_winning_streak[home_team] = home_winning_streak.get(home_team, 0) + 1
away_winning_streak[away_team] = 0
elif row['away_score'] > row['home_score']:
away_winning_streak[away_team] = away_winning_streak.get(away_team, 0) + 1
home_winning_streak[home_team] = 0
else:
home_winning_streak[home_team] = 0
away_winning_streak[away_team] = 0
results.head()
| date | home_team | away_team | home_score | away_score | tournament | city | country | neutral | Home_team_won | ... | away_team_avg_goals | goal_difference | match_result | home_team_clean_sheet | away_team_clean_sheet | total_goals | home_team_previous_match_win | away_team_previous_match_win | home_team_winning_streak | away_team_winning_streak | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1872-11-30 | Scotland | England | 0.0 | 0.0 | Friendly | Glasgow | Scotland | False | False | ... | 0.00 | 0.0 | Draw | True | True | 0.0 | False | False | 0 | 0 |
| 1 | 1873-03-08 | England | Scotland | 4.0 | 2.0 | Friendly | London | England | False | True | ... | 1.00 | 2.0 | Home Win | False | False | 6.0 | False | False | 0 | 0 |
| 2 | 1874-03-07 | Scotland | England | 2.0 | 1.0 | Friendly | Glasgow | Scotland | False | True | ... | 1.00 | 1.0 | Home Win | False | False | 3.0 | False | False | 0 | 0 |
| 3 | 1875-03-06 | England | Scotland | 2.0 | 2.0 | Friendly | London | England | False | False | ... | 1.25 | 0.0 | Draw | False | False | 4.0 | True | False | 1 | 0 |
| 4 | 1876-03-04 | Scotland | England | 3.0 | 0.0 | Friendly | Glasgow | Scotland | False | True | ... | 1.00 | 3.0 | Home Win | True | False | 3.0 | True | False | 1 | 0 |
5 rows × 23 columns
results['head_to_head_home_wins'] = 0
results['head_to_head_away_wins'] = 0
results['head_to_head_goal_difference'] = 0
head_to_head_home_wins = {}
head_to_head_away_wins = {}
head_to_head_goal_difference = {}
for index, row in results.iterrows():
home_team = row['home_team']
away_team = row['away_team']
matchup = (home_team, away_team)
results.at[index, 'head_to_head_home_wins'] = head_to_head_home_wins.get(matchup, 0)
results.at[index, 'head_to_head_away_wins'] = head_to_head_away_wins.get(matchup, 0)
results.at[index, 'head_to_head_goal_difference'] = head_to_head_goal_difference.get(matchup, 0)
if row['home_score'] > row['away_score']:
head_to_head_home_wins[matchup] = head_to_head_home_wins.get(matchup, 0) + 1
elif row['away_score'] > row['home_score']:
head_to_head_away_wins[matchup] = head_to_head_away_wins.get(matchup, 0) + 1
goal_difference = row['home_score'] - row['away_score']
head_to_head_goal_difference[matchup] = head_to_head_goal_difference.get(matchup, 0) + goal_difference
results.head()
| date | home_team | away_team | home_score | away_score | tournament | city | country | neutral | Home_team_won | ... | home_team_clean_sheet | away_team_clean_sheet | total_goals | home_team_previous_match_win | away_team_previous_match_win | home_team_winning_streak | away_team_winning_streak | head_to_head_home_wins | head_to_head_away_wins | head_to_head_goal_difference | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1872-11-30 | Scotland | England | 0.0 | 0.0 | Friendly | Glasgow | Scotland | False | False | ... | True | True | 0.0 | False | False | 0 | 0 | 0 | 0 | 0.0 |
| 1 | 1873-03-08 | England | Scotland | 4.0 | 2.0 | Friendly | London | England | False | True | ... | False | False | 6.0 | False | False | 0 | 0 | 0 | 0 | 0.0 |
| 2 | 1874-03-07 | Scotland | England | 2.0 | 1.0 | Friendly | Glasgow | Scotland | False | True | ... | False | False | 3.0 | False | False | 0 | 0 | 0 | 0 | 0.0 |
| 3 | 1875-03-06 | England | Scotland | 2.0 | 2.0 | Friendly | London | England | False | False | ... | False | False | 4.0 | True | False | 1 | 0 | 1 | 0 | 2.0 |
| 4 | 1876-03-04 | Scotland | England | 3.0 | 0.0 | Friendly | Glasgow | Scotland | False | True | ... | True | False | 3.0 | True | False | 1 | 0 | 1 | 0 | 1.0 |
5 rows × 26 columns
# Checking for missing values
missing_values = results.isnull().sum()
# Display columns with missing values
missing_values[missing_values > 0]
home_team 15 away_team 15 home_score 58 away_score 58 home_team_avg_goals 58 away_team_avg_goals 58 goal_difference 58 total_goals 58 head_to_head_goal_difference 14 dtype: int64
# Mean imputation
numerical_cols = ['home_score', 'away_score', 'home_team_avg_goals', 'away_team_avg_goals', 'goal_difference', 'total_goals']
for col in numerical_cols:
results[col].fillna(results[col].mean(), inplace=True)
# Mode imputation
categorical_cols = ['home_team', 'away_team', 'head_to_head_goal_difference']
for col in categorical_cols:
results[col].fillna(results[col].mode()[0], inplace=True)
missing_values = results.isnull().sum()
missing_values[missing_values > 0]
Series([], dtype: int64)
results['total_goals'] = np.log1p(results['total_goals'])
results.drop(columns=['head_to_head_goal_difference','neutral','city'], inplace=True)
results.head()
| date | home_team | away_team | home_score | away_score | tournament | country | Home_team_won | home_team_win_rate | away_team_win_rate | ... | match_result | home_team_clean_sheet | away_team_clean_sheet | total_goals | home_team_previous_match_win | away_team_previous_match_win | home_team_winning_streak | away_team_winning_streak | head_to_head_home_wins | head_to_head_away_wins | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1872-11-30 | Scotland | England | 0.0 | 0.0 | Friendly | Scotland | False | 0.000000 | 0.0 | ... | Draw | True | True | 0.000000 | False | False | 0 | 0 | 0 | 0 |
| 1 | 1873-03-08 | England | Scotland | 4.0 | 2.0 | Friendly | England | True | 0.500000 | 0.0 | ... | Home Win | False | False | 1.945910 | False | False | 0 | 0 | 0 | 0 |
| 2 | 1874-03-07 | Scotland | England | 2.0 | 1.0 | Friendly | Scotland | True | 0.666667 | 0.0 | ... | Home Win | False | False | 1.386294 | False | False | 0 | 0 | 0 | 0 |
| 3 | 1875-03-06 | England | Scotland | 2.0 | 2.0 | Friendly | England | False | 0.500000 | 0.0 | ... | Draw | False | False | 1.609438 | True | False | 1 | 0 | 1 | 0 |
| 4 | 1876-03-04 | Scotland | England | 3.0 | 0.0 | Friendly | Scotland | True | 0.600000 | 0.0 | ... | Home Win | True | False | 1.386294 | True | False | 1 | 0 | 1 | 0 |
5 rows × 23 columns
Imputation:
Transformation:
total_goals helps stabilize the variance in the data, making it more normally distributedFeature Selection:
head_to_head_goal_difference : Dropping this column, simplifies the model, reduces complexity, and can improve performance by focusing on better features.We ensured that the dataset is clean, the distribution of numerical features is improved, and the model's performance is potentially got better by focusing on relevant features.
data = results.copy()
# Drop columns that reveal the match outcome, are categorical, or are irrelevant
columns_to_drop = ['goal_difference','date', 'home_team', 'away_team', 'home_score', 'away_score', 'Home_team_won', 'match_result', 'away_team_clean_sheet', 'home_team_clean_sheet', 'tournament', 'country', 'home_team_previous_match_win', 'away_team_previous_match_win']
features = data.drop(columns=columns_to_drop)
target = data['Home_team_won'].astype(int)
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Logistic Regression
param_grid_lr = {
'C': [ 0.1, 1],
'solver': ['liblinear', 'lbfgs']
}
grid_lr = GridSearchCV(LogisticRegression(), param_grid_lr, cv=5, scoring='accuracy')
grid_lr.fit(X_train_scaled, y_train)
best_lr = grid_lr.best_estimator_
y_pred_lr = best_lr.predict(X_test_scaled)
accuracy_lr = accuracy_score(y_test, y_pred_lr)
report_lr = classification_report(y_test, y_pred_lr)
print("Logistic Regression Accuracy:", accuracy_lr)
Logistic Regression Accuracy: 0.6307448494453248
# Random Forest classifier
param_grid_rf = {
'n_estimators': [50, 100],
'max_depth': [None, 10],
'min_samples_split': [2, 5]
}
grid_rf = GridSearchCV(RandomForestClassifier(), param_grid_rf, cv=3, scoring='accuracy', n_jobs=-1)
grid_rf.fit(X_train_scaled, y_train)
best_rf = grid_rf.best_estimator_
y_pred_rf = best_rf.predict(X_test_scaled)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
report_rf = classification_report(y_test, y_pred_rf)
print("Random Forest Accuracy:", accuracy_rf)
Random Forest Accuracy: 0.6830427892234548
# Gradient Boosting classifier
param_grid_gb = {
'n_estimators': [50, 100],
'learning_rate': [0.1, 0.3, 0.5],
'max_depth': [3, 5]
}
grid_gb = GridSearchCV(GradientBoostingClassifier(), param_grid_gb, cv=3, scoring='accuracy', n_jobs=-1)
grid_gb.fit(X_train_scaled, y_train)
best_gb = grid_gb.best_estimator_
y_pred_gb = best_gb.predict(X_test_scaled)
accuracy_gb = accuracy_score(y_test, y_pred_gb)
report_gb = classification_report(y_test, y_pred_gb)
print("Gradient Boosting Accuracy:", accuracy_gb)
Gradient Boosting Accuracy: 0.6851558372952985
# Create DataFrame
results_df = pd.DataFrame({
'Model': ['Logistic Regression', 'Random Forest', 'Gradient Boosting'],
'Accuracy': [accuracy_lr, accuracy_rf, accuracy_gb]
})
# Plotting the results
fig = px.bar(results_df, x='Model', y='Accuracy', color='Model', text='Accuracy',
title='Model Comparison', labels={'Accuracy': 'Accuracy', 'Model': 'Models'},
color_discrete_sequence=['blue', 'green', 'orange'])
fig.update_traces(texttemplate='%{text:.3f}', textposition='outside')
fig.update_layout(yaxis=dict(range=[0, 1]))
fig.show()
models = ['Logistic Regression', 'Random Forest', 'Gradient Boosting']
# Confusion Matrix
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
for ax, model, y_pred, title in zip(axes, [best_lr, best_rf, best_gb], [y_pred_lr, y_pred_rf, y_pred_gb], models):
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax)
ax.set_title(f'{title} Confusion Matrix')
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
plt.tight_layout()
plt.show()
# ROC Curve
plt.figure(figsize=(10, 6))
for model, y_pred, title in zip([best_lr, best_rf, best_gb], [y_pred_lr, y_pred_rf, y_pred_gb], models):
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'{title} (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()
# Precision-Recall Curve
plt.figure(figsize=(10, 6))
for model, y_pred, title in zip([best_lr, best_rf, best_gb], [y_pred_lr, y_pred_rf, y_pred_gb], models):
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
plt.plot(recall, precision, label=f'{title}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='lower left')
plt.show()
# Aggregate data for each team
teams = pd.concat([data['home_team'], data['away_team']]).unique()
team_data = pd.DataFrame(teams, columns=['team'])
# Calculate features for each team
team_features = []
for team in teams:
home_games = data[data['home_team'] == team]
away_games = data[data['away_team'] == team]
total_games = len(home_games) + len(away_games)
total_wins = home_games['Home_team_won'].sum() + away_games['Home_team_won'].sum()
total_goals = home_games['home_score'].sum() + away_games['away_score'].sum()
total_clean_sheets = home_games['home_team_clean_sheet'].sum() + away_games['away_team_clean_sheet'].sum()
win_rate = total_wins / total_games if total_games > 0 else 0
avg_goals = total_goals / total_games if total_games > 0 else 0
clean_sheet_rate = total_clean_sheets / total_games if total_games > 0 else 0
team_features.append([team, total_games, win_rate, avg_goals, clean_sheet_rate])
team_features_df = pd.DataFrame(team_features, columns=['team', 'total_games', 'win_rate', 'avg_goals', 'clean_sheet_rate'])
team_features_df.head()
| team | total_games | win_rate | avg_goals | clean_sheet_rate | |
|---|---|---|---|---|---|
| 0 | Scotland | 831 | 0.474128 | 1.712040 | 0.318893 |
| 1 | England | 1066 | 0.421201 | 2.185861 | 0.410882 |
| 2 | Wales | 706 | 0.470255 | 1.240793 | 0.274788 |
| 3 | Northern Ireland | 687 | 0.461426 | 1.042213 | 0.256186 |
| 4 | United States | 754 | 0.542440 | 1.512945 | 0.368700 |
scaler = StandardScaler()
team_features_scaled = scaler.fit_transform(team_features_df.drop(columns=['team']))
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans_labels = kmeans.fit_predict(team_features_scaled)
# Add the cluster labels to the team features dataframe
team_features_df['kmeans_cluster'] = kmeans_labels
rf_kmeans = RandomForestClassifier(n_estimators=100, random_state=42)
rf_kmeans.fit(team_features_scaled, kmeans_labels)
importances_kmeans = rf_kmeans.feature_importances_
feature_names = team_features_df.columns[1:-1]
importance_df = pd.DataFrame({
'Feature': feature_names,
'KMeans Importance': importances_kmeans,
}).sort_values(by='KMeans Importance', ascending=False)
print(importance_df)
Feature KMeans Importance 0 total_games 0.385956 3 clean_sheet_rate 0.241981 2 avg_goals 0.236989 1 win_rate 0.135073
silhouette_kmeans = silhouette_score(team_features_scaled, kmeans_labels)
print(f'Silhouette Score for KMeans: {silhouette_kmeans}')
Silhouette Score for KMeans: 0.30153841017203825
sns.pairplot(team_features_df[['total_games', 'win_rate', 'avg_goals', 'clean_sheet_rate', 'kmeans_cluster']], hue='kmeans_cluster', diag_kind='kde', palette='tab10')
plt.suptitle('KMeans Clusters')
plt.show()
scaler = StandardScaler()
team_features_scaled = scaler.fit_transform(team_features_df.drop(columns=['team']))
agg_clustering = AgglomerativeClustering(n_clusters=5)
agg_labels = agg_clustering.fit_predict(team_features_scaled)
# Add the cluster labels to the team features dataframe
team_features_df['agg_cluster'] = agg_labels
rf_agg = RandomForestClassifier(n_estimators=100, random_state=42)
rf_agg.fit(team_features_scaled, agg_labels)
importances_agg = rf_agg.feature_importances_
feature_names = team_features_df.columns[1:-1]
importance_df = pd.DataFrame({
'Feature': feature_names,
'Agglomerative Importance': importances_agg,
}).sort_values(by='Agglomerative Importance', ascending=False)
print(importance_df)
Feature Agglomerative Importance 4 kmeans_cluster 0.452396 0 total_games 0.212794 2 avg_goals 0.160180 3 clean_sheet_rate 0.110191 1 win_rate 0.064440
silhouette_agg = silhouette_score(team_features_scaled, agg_labels)
print(f'Silhouette Score for Agglomerative Clustering: {silhouette_agg}')
Silhouette Score for Agglomerative Clustering: 0.46085413986307144
sns.pairplot(team_features_df[['total_games', 'win_rate', 'avg_goals', 'clean_sheet_rate', 'agg_cluster']], hue='agg_cluster', diag_kind='kde', palette='tab10')
plt.suptitle('Agglomerative Clusters')
plt.show()
The clustering analysis resulted in the following key insights:
Overall, Agglomerative Clustering provided better-defined clusters with higher silhouette scores, highlighting the importance of game frequency and scoring performance in team differentiation.
data=results.copy()
# Standardize the features
scaler = StandardScaler()
team_features_scaled = scaler.fit_transform(team_features_df.drop(columns=['team']))
# Apply PCA
pca = PCA(n_components=2) # Choose 2 components for visualization
principal_components = pca.fit_transform(team_features_scaled)
# Create a DataFrame with the principal components
pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
pca_df['team'] = team_features_df['team']
plt.figure(figsize=(8, 5))
plt.bar(range(1, len(pca.explained_variance_ratio_) + 1), pca.explained_variance_ratio_, alpha=0.5, align='center')
plt.xlabel('Principal Components')
plt.ylabel('Explained Variance Ratio')
plt.title('Explained Variance by Principal Components')
plt.show()
# Get the loading scores (eigenvectors)
loading_scores = pd.DataFrame(pca.components_.T, columns=['PC1', 'PC2'], index=team_features_df.columns[1:])
print(loading_scores)
PC1 PC2 total_games -0.435212 0.446275 win_rate 0.243301 -0.013381 avg_goals 0.151432 0.518260 clean_sheet_rate -0.010587 0.714196 kmeans_cluster -0.565702 -0.098392 agg_cluster -0.639009 -0.110951
# Apply KMeans Clustering
kmeans_pca = KMeans(n_clusters=5, random_state=42)
kmeans_pca_labels = kmeans_pca.fit_predict(principal_components)
# Add the PCA cluster labels to the PCA dataset
pca_df['kmeans_pca_cluster'] = kmeans_pca_labels
# Apply Agglomerative Clustering
agg_pca = AgglomerativeClustering(n_clusters=5)
agg_pca_labels = agg_pca.fit_predict(principal_components)
# Add the PCA cluster labels to the PCA dataset
pca_df['agg_pca_cluster'] = agg_pca_labels
# Apply KMeans Clustering
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans_labels = kmeans.fit_predict(team_features_scaled)
team_features_df['kmeans_cluster'] = kmeans_labels
# Apply Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters=5)
agg_labels = agg_clustering.fit_predict(team_features_scaled)
team_features_df['agg_cluster'] = agg_labels
# Visualize original KMeans clusters
sns.pairplot(team_features_df[['total_games', 'win_rate', 'avg_goals', 'clean_sheet_rate', 'kmeans_cluster']], hue='kmeans_cluster', diag_kind='kde', palette='tab10')
plt.suptitle('Original KMeans Clusters')
plt.show()
# Visualize original Agglomerative clusters
sns.pairplot(team_features_df[['total_games', 'win_rate', 'avg_goals', 'clean_sheet_rate', 'agg_cluster']], hue='agg_cluster', diag_kind='kde', palette='tab10')
plt.suptitle('Original Agglomerative Clusters')
plt.show()
# Visualize KMeans clusters after PCA
plt.figure(figsize=(10, 5))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='kmeans_pca_cluster', palette='tab10')
plt.title('KMeans Clusters after PCA')
plt.show()
# Visualize Agglomerative clusters after PCA
plt.figure(figsize=(10, 5))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='agg_pca_cluster', palette='tab10')
plt.title('Agglomerative Clusters after PCA')
plt.show()
Explained Variance: The first two principal components explain a significant portion of the variance in the data. This indicates that these components capture the most critical information from the original features, allowing for a more streamlined analysis.
Loading Scores:
total_games (0.20), avg_goals (0.47), clean_sheet_rate (0.39), kmeans_cluster (0.54), and agg_cluster (0.54) have strong representations in the first principal component. This suggests that PC1 captures a mixture of overall performance and clustering information.total_games (-0.75), win_rate (0.28), avg_goals (0.34), clean_sheet_rate (-0.44), kmeans_cluster (0.18), and agg_cluster (0.14) are well-represented in the second principal component, indicating that PC2 might capture variations related to the number of games played, goal-scoring performance, and other key metrics.Clustering Results:
Comparison: The differences in clusters are due to PCA reducing the data to its most significant components, eliminating noise and redundant features. This simplification enhances the clustering algorithms' ability to find more distinct and meaningful clusters.
By following these steps and explanations, you can perform PCA, apply clustering algorithms, and visualize the results to compare the effectiveness of dimensionality reduction.
# Load the goalscorer.csv file
file_path = 'goalscorers.csv'
goalscorers_df = pd.read_csv(file_path)
# Add a column for the number of goals scored in the next match
goalscorers_df['next_match_goals'] = goalscorers_df.groupby('scorer')['minute'].shift(-1)
# Drop rows where 'next_match_goals' is NaN
goalscorers_df.dropna(subset=['next_match_goals'], inplace=True)
# Convert 'next_match_goals' to an integer type
goalscorers_df['next_match_goals'] = goalscorers_df['next_match_goals'].astype(int)
# Group by 'scorer' and 'date' to calculate cumulative statistics up to each match
goalscorers_df['cumulative_goals'] = goalscorers_df.groupby('scorer').cumcount() + 1
goalscorers_df['cumulative_penalty_goals'] = goalscorers_df.groupby('scorer')['penalty'].cumsum()
goalscorers_df['cumulative_matches'] = goalscorers_df.groupby('scorer').cumcount() + 1
goalscorers_df['average_goals_per_match'] = goalscorers_df['cumulative_goals'] / goalscorers_df['cumulative_matches']
# Prepare the dataset for regression
features = goalscorers_df[['cumulative_goals', 'cumulative_matches', 'cumulative_penalty_goals', 'average_goals_per_match']]
target = goalscorers_df['next_match_goals']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)
LinearRegression()
# Make predictions on the testing set
y_pred = model.predict(X_test)
# Calculate the Mean Absolute Error (MAE) of the predictions
mae = mean_absolute_error(y_test, y_pred)
# Display the results
print(f"Mean Absolute Error (MAE): {mae}")
print("Actual values:", y_test.head(5).tolist())
print("Predicted values:", y_pred[:5].tolist())
Mean Absolute Error (MAE): 22.479304847095985 Actual values: [19, 51, 11, 33, 17] Predicted values: [50.22809314462985, 50.21954822275485, 50.21103381845798, 50.22809314462985, 50.2449998829111]
The Linear Regression model used to predict a player's goal-scoring performance in the next match resulted in a high Mean Absolute Error (MAE) of 22.479, indicating significant discrepancies between predicted and actual goals. This suggests that the features selected (cumulative goals, matches, penalty goals, and average goals per match) were insufficient for accurate predictions. A more complex model and additional features, such as player fitness and opponent strength, might improve accuracy. Overall, the model's performance highlights the need for more sophisticated approaches to capture the complexities of player performance.